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Abstract.  A  key  objective  of  multidimensional  dataset  analysis  is  to 
reveal  patterns  of  interest  to  users,  but  can  be  difficult  to  conduct  due 
to  the  challenges  of  both  presenting  and  navigating  large  datasets.  This 
work  explores  how  initial  summarizations  of  multidimensional  datasets 
can  be  generated  (designed  to  reduce  the  number  of  data  points  which 
would  need  to  be  displayed),  using  summarization  policies  based  on 
provided  dataset  values.  Additionally,  functionality  for  explaining  the 
derivation  of  summarizations  is  being  designed  in  line  with  prior  work 
on  aiding  analyst  interactions  with  data  processing  systems.  To  help 
drive  development  of  this  work,  as  well  as  provide  illustrative  use  cases, 
we  are  presently  designing  a  dataset  summarization  generator  as  part  of 
greater  work  being  done  on  an  infrastructure  for  managing  evidence  of 
technical  emergence  in  varying  research  disciplines  via  automated  review 
of  published  materials. 
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1  Introduction 

A  key  objective  of  multidimensional  dataset  analysis  is  to  reveal  patterns  of 
interest  to  analysts.  In  many  cases,  these  analyses  will  involve  navigation  over 
a  dataset  to  expose  content  likely  to  have  interesting  patterns.  However,  mul¬ 
tidimensional  analysis  has  been  observed  to  be  challenging  to  analysts  for  the 
following  reasons  [1]: 

1.  They  may  be  overwhelmed  by  a  data  space  evidence  set  if  it  is  too  large. 

2.  They  may  not  have  time  or  expertise  to  perform  extensive  navigation. 

This  work  explores  how  initial  summarizations  of  multidimensional  datasets 
can  be  generated  for  consuming  parties  (designed  to  reduce  the  number  of  data 
points  which  would  need  to  be  displayed)  driven  by  summarization  policies 
based  on  provided  dataset  values.  Focus  has  been  given  to  RDF-based  dataset 
encodings,  due  largely  to  RDFs  flexibility  in  linking  to  outside  data  sources 
(e.g.,  ontologies  for  expressing  possible  data  values).  Finally,  functionality  for 
explaining  the  derivation  of  summarizations  is  being  developed  -  in  line  with 
prior  work  for  aiding  analyst  interactions  with  data  processing  systems  [2]. 
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2  Evidence  Summarization  in  the  ARBITER  System 

To  help  drive  development  of  this  work,  as  well  as  provide  illustrative  use  cases, 
we  are  presently  developing  a  dataset  summarization  generator  for  the  Abductive 
Reasoning  Based  on  Indicators  and  Topics  of  EmeRgence  (ARBITER)  system  - 
being  jointly  developed  by  Rensselaer,  BAE  Systems,  NYU,  Brandeis  and  1790 
Analytics  as  part  of  IARPA’s  Foresight  and  Understanding  from  Scientific  Ex¬ 
position  (FUSE)  program.  ARBITER’S  design  objective  is  to  scan  for  signs  of 
technical  emergence  in  published  literature  -  where  technical  emergence  is  de¬ 
fined  in  the  FUSE  program  as  [3]:  the  process  by  which  research  domains  appear, 
mature,  and  if  conditions  are  favorable,  make  a  significant  impact. 

In  ARBITER,  sets  of  one  or  more  evidence  entries  are  evaluated  to  make 
hypotheses  about  emergence-related  questions  for  a  given  topic  and  time  period. 
For  example:  Has  a  practical  application  for  DNA  Microarrays  been  established 
in  the  time  period  of  2006-2010,  based  on  the  document  collection  PubMed-f2? 

In  this  setting,  evidence  entries  are  defined  as  emergence  indicators,  calcu¬ 
lated  based  on  analysis  over  document  collections.  Indicators  are  classified  ac¬ 
cording  to  an  OWL  ontology  of  indicator  types,  where  each  indicator  is  defined 
to  have  at  least  one  RDF  type,  as  well  as  a  set  of  numerical  scoring  metrics  to 
define  relationship  of  evidence  to  hypothesis.  For  brevity,  an  example  is  provided 
with  five  indicators,  each  with  a  single  RDF  type  and  two  numerical  properties 
( value  and  relevance  to  the  question  answer,  where  a  higher  value  is  better). 


Indicator 
9  FunderCount 

9  CommercialFunderCount' 
0  GovernmentFunderCounf 
0  UniversityFunderCount 
0  PatentCrowthRate 
A  ResearcherCount 


Indicator  Type 

Value 

Relevance 

Count  of  Commercial 
Funding  Agencies 

19 

0.70 

Count  of  Government- 
based  Funding  Agencies 

82 

0.58 

Count  of  University- 
based  Funding  Agencies 

5 

0.42 

Growth  Rate  for 

Assigned  Patents 

1.09 

0.68 

Count  of  Participating 
Researchers 

518 

0.53 

Fig.  1.  Left:  Part  of  ARBITER’S  ontology,  which  defines  indicators.  Right:  A  sample  of 
indicator  data.  Here,  indicators  are  aligned  with  their  corresponding  ontology  classes. 


Currently,  these  evidence  entries  are  presented  as  a  2-dimensional  spread¬ 
sheet.  To  reduce  the  number  of  rows  directly  presented,  policy-based  summa¬ 
rization  techniques  are  being  explored  -  deriving  from  established  navigation 
techniques  in  OLAP  [1]:  grouping  rows  into  collection-based  entries,  as  well  as 
filtering  table  entries  -  each  based  on  specified  criteria.  For  this  submission,  the 
following  two  summarization  policies  are  provided  for  illustrative  purposes: 
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1.  Grouping:  Group  entries  together  that  are  SKOS1  subconcepts  of  the  ”Fun- 
derCount”  class. 

2.  Filtering:  Remove  entries  with  relevance  scores  below  0.55. 

Ultimately,  the  following  system  conditions  are  assumed:  (i)  A  maximum 
number  of  summary  rows  will  be  specified,  which  will  appear  in  the  presented 
summary;  (ii)  A  pre-defined  collection  of  policies  will  be  accessible  by  ARBITER, 
along  with  a  pre-defined  ordering  for  their  execution;  (iii)  Policies  will  be  sequen¬ 
tially  applied  to  the  evidence  set  until  the  summary  row  count  is  reached,  or  all 
policies  have  been  applied.  Initially,  an  evidence  dataset  Dq  will  represent  con¬ 
tent  directly  generated  by  evidence  gathering  routines  in  ARBITER.  Each  policy 
execution  will  yield  a  transformed  dataset  view  D i...n,  up  until  condition  (iii)  is 
satisfied. 

While  initial  summarization  can  be  a  powerful  aid  for  analyst  users,  care  has 
to  be  taken  in  their  usage,  since  one  summarization  strategy  may  not  be  appro¬ 
priate  for  all  users  and  information-seeking  tasks.  To  help  analysts  keep  track 
of  applied  strategies,  summaries  will  be  accompanied  by  explanations  of  their 
derivation  -  accessible  for  individual  entries.  In  Figure  2,  an  example  summary 
view  -  along  with  a  supporting  explanation  -  is  provided. 


Challenge  Question:  Has  a  practical  application  for  ONA  Microarrays  been  established  in  the 
time  pehod  of  2006-2010? 

Answer  to  Question:  Yes  Row  Limit:  2 

Summary  of  Supporting  Evidence: 


Why  Entry 

Is  Present 

Indicator  Type 

Value 

Relevance  to 
Question 

O 

Growth  Rate  for  Assigned  Patents 

1.09 

0.68 

O 

Count  of  Funding  Agencies  Expand  Entry 

106 

|0,7  | 

Summarization  Steps  Taken: 

1)  Grouping:  All  subconcepts  of  type  "FunderCount" 

2)  Filtering:  Entry  above  threshold  for  property 
"Relevance  to  Question"  set  at  "55" 

How  property  "Value"  derived: 

SUM  of  "Value"  entries  applied  over  entries  from 
grouping  in  (1) 

How  property  "Relevance  to  Question"  derived: 

AVERAGE  of  "Relevance  to  Question"  entries 
.  applied  over  entries  from  grouping  in  (1) 


[ 


0  Indicator 
▼  0  FunderCoud 

0  CommercialFunderCount 
0  GovernmentFunderCount 
9  UniversityFunderCount 
0  PatentGrowthRate 
0  ResearcherCount 


Fig.  2.  Here,  a  summary  view  is  displayed,  with  a  collection-based  entry  highlighted 
in  green  (which  aggregates  the  3  funding  entries  from  Figure  1).  An  explanation  of 
why  this  entry  is  present  -  and  how  column  values  were  derived  -  can  be  accessed  by 
clicking  the  ”i”  icon. 


1  RDF  SKOS  Vocabulary:  http://www.w3.org/TR/skos-reference/ 
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System  Development:  ARBITERS  summary  generator  is  being  designed 
to  take  three  inputs:  (i)  A  set  of  fine-grained  evidence;  (ii)  A  set  of  SPARQL- 
encoded  preference  policies,  along  with  an  accompanying  execution  order;  and 
(iii)  Corresponding  ontologies  for  encoding  the  preference  and  evidence  data.  For 
encoding  evidence,  we  are  now  exploring  use  of  the  RDF  Datacube2  vocabulary 
-  given  its  support  for  representing  multidimensional  data. 

Upcoming  Directions:  In  upcoming  work,  focus  will  be  given  to  the  fol¬ 
lowing  three  issues:  (i)  selection  of  summarization  policies  which  align  with  an 
analysts  perceived  preferences,  (ii)  based  on  the  summarization  explanations 
provided,  enabling  analysts  to  tweak  applied  strategies  to  generate  new  summa- 
rizations,  and  (iii)  enabling  analysts  to  identify  source  documents  used  to  create 
evidence  entries  (similar  to  efforts  discussed  in  [2]).  For  situations  where  signif¬ 
icant  numbers  of  evidence  entries  are  presented  (e.g.,  over  100),  all  three  issues 
are  expected  to  need  addressing. 
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